Introduction

Overview and Motivation

Our inspiration was from Kaggle competition - Instacart Market Basket Analysis which is also the data sets’ resource. Instacart is a grocery ordering and delivery application. They provide an anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart’s users, and for each user, they provide between 4 and 100 of their orders, with the sequence of products purchased in each order, the week and hour of day the order was placed, and a relative measure of time between orders (details of each data set will be introduced below).

Instacart hopes campaign participants test models for predicting products that a user will buy again, try for the first time or add to cart next during a session, which may need to use the the models such as XGBoost, word2vec and Annoy.

Repurchase predicting and order placement day predicting are the popular and helpful predictions among e-commerce companies. For example, Amazon has already developed a patent called “anticipatory shipping” that can predict what and when people want to buy and ship packages even before customers have placed an order. In this case, they can largely optimizing logistics management, human and equipment resources and inventory arrangement, so that it would help them to decrease cost and increase profit. Meantime, this type of prediction also requires much more information of customers’ behavior, such as items customers have searched for, the amount of time a user’s cursor hovers over a product, times of clicks by users, purchase conversion rate of users’ click, add to cart, collection and so on.

In this case, since there are limitation of information and we would like to apply what models we have learnt in the course, we prefer to predict the day of the week that the order will be placed. Then, this would be an additional predictor to support the demand forecasting which is useful to make a right direction in the decision-making process, like inventory arrangement, for the e-commerce platform.

Research questions

Overall, we produce a new dataset based on what we have downloaded from the competition website, and assume that:

  1. one order = one user (as the data limitation in the data set mentioned above);
  2. we have already known what customers will buy in the next time, which means we have already known the demand.

Thus, our research questions will be:

  • What day of the week that a given order will be placed?
    For this question, we will use supervised methods - Classification Tree and Multiple Logistic Regression.

  • Are there any common components between departments or aisles?
    For this question, we will use unsupervised methods - PCA and Clustering.

Exploratory Data Analysis

Data Description

orders (3.4m rows, 206k users):
* order_id: order identifier
* user_id: customer identifier
* eval_set: which evaluation set this order belongs in (see SET described below)
* order_number: the order sequence number for this user (1 = first, n = nth)
* order_dow: the day of the week the order was placed on
* order_hour_of_day: the hour of the day the order was placed on
* days_since_prior: days since the last order, capped at 30 (with NAs for order_number = 1)

products (50k rows):
* product_id: product identifier
* product_name: name of the product
* aisle_id: foreign key
* department_id: foreign key

aisles (134 rows):
* aisle_id: aisle identifier
* aisle: the name of the aisle

departments (21 rows):
* department_id: department identifier
* department: the name of the department

order_products__SET (30m+ rows):
* order_id: foreign key
* product_id: foreign key
* add_to_cart_order: order in which each product was added to cart
* reordered: 1 if this product has been ordered by this user in the past, 0 otherwise

where SET is one of the four following evaluation sets (eval_set in orders):
* "prior": orders prior to that users most recent order (~3.2m orders)
* "train": training data supplied to participants (~131k orders)
* "test": test data reserved for machine learning competitions (~75k orders)

Table 1 - aisles
#> [1] 0
The aisles table
aisle_id aisle
1 prepared soups salads
2 specialty cheeses
3 energy granola bars
4 instant foods
5 marinades meat preparation
6 other
7 packaged meat
8 bakery desserts
9 pasta sauce
10 kitchen supplies
11 cold flu allergy
12 fresh pasta
13 prepared meals
14 tofu meat alternatives
15 packaged seafood
16 fresh herbs
17 baking ingredients
18 bulk dried fruits vegetables
19 oils vinegars
20 oral hygiene
21 packaged cheese
22 hair care
23 popcorn jerky
24 fresh fruits
25 soap
26 coffee
27 beers coolers
28 red wines
29 honeys syrups nectars
30 latino foods
31 refrigerated
32 packaged produce
33 kosher foods
34 frozen meat seafood
35 poultry counter
36 butter
37 ice cream ice
38 frozen meals
39 seafood counter
40 dog food care
41 cat food care
42 frozen vegan vegetarian
43 buns rolls
44 eye ear care
45 candy chocolate
46 mint gum
47 vitamins supplements
48 breakfast bars pastries
49 packaged poultry
50 fruit vegetable snacks
51 preserved dips spreads
52 frozen breakfast
53 cream
54 paper goods
55 shave needs
56 diapers wipes
57 granola
58 frozen breads doughs
59 canned meals beans
60 trash bags liners
61 cookies cakes
62 white wines
63 grains rice dried goods
64 energy sports drinks
65 protein meal replacements
66 asian foods
67 fresh dips tapenades
68 bulk grains rice dried goods
69 soup broth bouillon
70 digestion
71 refrigerated pudding desserts
72 condiments
73 facial care
74 dish detergents
75 laundry
76 indian foods
77 soft drinks
78 crackers
79 frozen pizza
80 deodorants
81 canned jarred vegetables
82 baby accessories
83 fresh vegetables
84 milk
85 food storage
86 eggs
87 more household
88 spreads
89 salad dressing toppings
90 cocoa drink mixes
91 soy lactosefree
92 baby food formula
93 breakfast bakery
94 tea
95 canned meat seafood
96 lunch meat
97 baking supplies decor
98 juice nectars
99 canned fruit applesauce
100 missing
101 air fresheners candles
102 baby bath body care
103 ice cream toppings
104 spices seasonings
105 doughs gelatins bake mixes
106 hot dogs bacon sausage
107 chips pretzels
108 other creams cheeses
109 skin care
110 pickled goods olives
111 plates bowls cups flatware
112 bread
113 frozen juice
114 cleaning products
115 water seltzer sparkling water
116 frozen produce
117 nuts seeds dried fruit
118 first aid
119 frozen dessert
120 yogurt
121 cereal
122 meat counter
123 packaged vegetables fruits
124 spirits
125 trail mix snack mix
126 feminine care
127 body lotions soap
128 tortillas flat bread
129 frozen appetizers sides
130 hot cereal pancake mixes
131 dry pasta
132 beauty
133 muscles joints pain relief
134 specialty wines champagnes
Table 2 - departments
#> [1] 0
The departments table
department_id department
1 frozen
2 other
3 bakery
4 produce
5 alcohol
6 international
7 beverages
8 pets
9 dry goods pasta
10 bulk
11 personal care
12 meat seafood
13 pantry
14 breakfast
15 canned goods
16 dairy eggs
17 household
18 babies
19 snacks
20 deli
21 missing
Table 3 - products
#> [1] 0
The products table
product_id product_name aisle_id department_id
1 Chocolate Sandwich Cookies 61 19
2 All-Seasons Salt 104 13
3 Robust Golden Unsweetened Oolong Tea 94 7
4 Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce 38 1
5 Green Chile Anytime Sauce 5 13
6 Dry Nose Oil 11 11
7 Pure Coconut Water With Orange 98 7
8 Cut Russet Potatoes Steam N’ Mash 116 1
9 Light Strawberry Blueberry Yogurt 120 16
10 Sparkling Orange Juice & Prickly Pear Beverage 115 7
11 Peach Mango Juice 31 7
12 Chocolate Fudge Layer Cake 119 1
13 Saline Nasal Mist 11 11
14 Fresh Scent Dishwasher Cleaner 74 17
15 Overnight Diapers Size 6 56 18
16 Mint Chocolate Flavored Syrup 103 19
17 Rendered Duck Fat 35 12
18 Pizza for One Suprema Frozen Pizza 79 1
19 Gluten Free Quinoa Three Cheese & Mushroom Blend 63 9
20 Pomegranate Cranberry & Aloe Vera Enrich Drink 98 7
21 Small & Medium Dental Dog Treats 40 8
22 Fresh Breath Oral Rinse Mild Mint 20 11
23 Organic Turkey Burgers 49 12
24 Tri-Vi-Sol® Vitamins A-C-and D Supplement Drops for Infants 47 11
25 Salted Caramel Lean Protein & Fiber Bar 3 19
26 Fancy Feast Trout Feast Flaked Wet Cat Food 41 8
27 Complete Spring Water Foaming Antibacterial Hand Wash 127 11
28 Wheat Chex Cereal 121 14
29 Fresh Cut Golden Sweet No Salt Added Whole Kernel Corn 81 15
30 Three Cheese Ziti, Marinara with Meatballs 38 1
31 White Pearl Onions 123 4
32 Nacho Cheese White Bean Chips 107 19
33 Organic Spaghetti Style Pasta 131 9
34 Peanut Butter Cereal 121 14
35 Italian Herb Porcini Mushrooms Chicken Sausage 106 12
36 Traditional Lasagna with Meat Sauce Savory Italian Recipes 38 1
37 Noodle Soup Mix With Chicken Broth 69 15
38 Ultra Antibacterial Dish Liquid 100 21
39 Daily Tangerine Citrus Flavored Beverage 64 7
40 Beef Hot Links Beef Smoked Sausage With Chile Peppers 106 12
41 Organic Sourdough Einkorn Crackers Rosemary 78 19
42 Biotin 1000 mcg 47 11
43 Organic Clementines 123 4
44 Sparkling Raspberry Seltzer 115 7
45 European Cucumber 83 4
46 Raisin Cinnamon Bagels 5 count 58 1
47 Onion Flavor Organic Roasted Seaweed Snack 66 6
48 School Glue, Washable, No Run 87 17
49 Vegetarian Grain Meat Sausages Italian - 4 CT 14 20
50 Pumpkin Muffin Mix 105 13
Table 4 - order_products_train
#> [1] 0
The order_products_train table
order_id product_id add_to_cart_order reordered
1 49302 1 1
1 11109 2 1
1 10246 3 0
1 49683 4 0
1 43633 5 1
1 13176 6 0
1 47209 7 0
1 22035 8 1
36 39612 1 0
36 19660 2 1
36 49235 3 0
36 43086 4 1
36 46620 5 1
36 34497 6 1
36 48679 7 1
36 46979 8 1
38 11913 1 0
38 18159 2 0
38 4461 3 0
38 21616 4 1
38 23622 5 0
38 32433 6 0
38 28842 7 0
38 42625 8 0
38 39693 9 0
96 20574 1 1
96 30391 2 0
96 40706 3 1
96 25610 4 0
96 27966 5 1
96 24489 6 1
96 39275 7 1
98 8859 1 1
98 19731 2 1
98 43654 3 1
98 13176 4 1
98 4357 5 1
98 37664 6 1
98 34065 7 1
98 35951 8 1
98 43560 9 1
98 9896 10 1
98 27509 11 1
98 15455 12 1
98 27966 13 1
98 47601 14 1
98 40396 15 1
98 35042 16 1
98 40986 17 1
98 1939 18 1
Table 5 - purchase time per order table
#> [1] 206209
#> [1] 206209
The purchase time per order table
order_id order_dow order_hour_of_day
1187899 4 8
1492625 1 11
2196797 0 11
525192 2 11
880375 1 14
1094988 6 10
1822501 0 19
1827621 0 21
2316178 2 19
2180313 3 10
2461523 6 9
1854765 1 12
3402036 1 12
965160 0 16
2614670 5 14
3110252 4 11
62370 2 13
698604 4 13
1524161 0 13
3173750 0 9
2032076 0 20
2803975 0 11
1864787 5 11
2436259 0 12
1947848 4 20
2906490 4 22
2924697 5 18
519514 4 12
1750084 3 9
1647290 4 16
3088145 2 10
39325 2 18
13318 1 9
1651215 0 12
1019719 2 12
2989905 6 8
2639013 0 13
1072954 6 17
34647 3 19
2757217 0 11
669729 5 12
3038639 5 13
2608424 2 14
482516 4 7
3294399 4 8
1700658 6 11
21708 0 6
2178718 2 8
1734166 5 18
859654 1 10

We can observe on the left chart oder_dow that the most frequent days of ordering are Sunday’s and Monday’s comparing to the rest of the week, and on the right chart order_hour_of_day,we note a high demand of orders between 9am to 6pm.

Table 6 - user_purchases
The user purchases table
order_id order_dow order_hour_of_day aisle_id aisle department_id department
1187899 4 8 77 soft drinks 7 beverages
1187899 4 8 21 packaged cheese 16 dairy eggs
1187899 4 8 120 yogurt 16 dairy eggs
1187899 4 8 54 paper goods 17 household
1187899 4 8 45 candy chocolate 19 snacks
1187899 4 8 117 nuts seeds dried fruit 19 snacks
1187899 4 8 121 cereal 14 breakfast
1187899 4 8 23 popcorn jerky 19 snacks
1187899 4 8 84 milk 16 dairy eggs
1187899 4 8 53 cream 16 dairy eggs
1187899 4 8 77 soft drinks 7 beverages
1492625 1 11 96 lunch meat 20 deli
1492625 1 11 58 frozen breads doughs 1 frozen
1492625 1 11 107 chips pretzels 19 snacks
1492625 1 11 23 popcorn jerky 19 snacks
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 24 fresh fruits 4 produce
1492625 1 11 91 soy lactosefree 16 dairy eggs
1492625 1 11 46 mint gum 19 snacks
1492625 1 11 96 lunch meat 20 deli
1492625 1 11 80 deodorants 11 personal care
1492625 1 11 1 prepared soups salads 20 deli
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 38 frozen meals 1 frozen
1492625 1 11 69 soup broth bouillon 15 canned goods
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 37 ice cream ice 1 frozen
1492625 1 11 117 nuts seeds dried fruit 19 snacks
1492625 1 11 3 energy granola bars 19 snacks
1492625 1 11 69 soup broth bouillon 15 canned goods
1492625 1 11 69 soup broth bouillon 15 canned goods
2196797 0 11 29 honeys syrups nectars 13 pantry
2196797 0 11 24 fresh fruits 4 produce
2196797 0 11 21 packaged cheese 16 dairy eggs
2196797 0 11 66 asian foods 6 international
2196797 0 11 101 air fresheners candles 17 household
2196797 0 11 83 fresh vegetables 4 produce
2196797 0 11 66 asian foods 6 international
2196797 0 11 123 packaged vegetables fruits 4 produce

Visualization

Top 10 number of purchase by aisle

The top 10 number of purchase by aisle
aisle department total_order
fresh vegetables produce 150609
fresh fruits produce 150473
packaged vegetables fruits produce 78493
yogurt dairy eggs 55240
packaged cheese dairy eggs 41699
water seltzer sparkling water beverages 36617
milk dairy eggs 32644
chips pretzels snacks 31269
soy lactosefree dairy eggs 26240
bread bakery 23635

The number of purchase by department

The top 10 number of purchase by department
department total_order
produce 409087
dairy eggs 217051
snacks 118862
beverages 114046
frozen 100426
pantry 81242
bakery 48394
canned goods 46799
deli 44291
dry goods pasta 38713

Sales Patterns
Here, we would like to observe the pattern of sales in depth by spiltting into departments. First, it is the pattern of weekly sales.


From these graphs, we could observe the patterns as follow:

  1. Although in the graph shown at the beginning illustrates that the peak of purchase usually is on Sunday and Monday, we can see alcohol is the exception here. For Alcohol, the figure increases slightly from the trough on Monday and reaches the top on Friday, then decreases sharply on Saturday.

  2. The other departments have similar pattern. The figures decrease from the top on Sunday, and then start increasing on Friday.

PCA

We analyze the association between the numbers of orders from different departments.

PCA by department PCA explains the similarity of variables. There are two metrics which are correlation(scaled) and covariance(not scaled). In our analysis, we focus on the relationship between the number of order from each department and day-of-week that users purchase. Thus, we will focus our PCA analysis on non-scale, i.e. using covariance. However, it would be interesting to see the differences of the results between scale and non-scaled PCAs as well, so we will also perform the PCA analysis with correlations.

Non-scaled PCA (Covariance) We observe that the first and second components explain 46.68% and 13.76% of variance of the data. Referring to the rule of thumb which selects the number of dimensions that allow to explain at least 75% of the variation, therefore comp 1 - comp 5 are selected and around 79.8% of variance of the data are explained.

Our finding: 1. Produce has the highest variation. Also, it is highly positively correlated with Dim1 and negatively correlated with Dim2 2. The other departments including the second to sixth largest variance variables(Dairy egg, Snacks, Frozen, Beverages and Pantry) are positively correlated with Dim1 and Dim2.

#>         eigenvalue percentage of variance
#> comp 1     12.6419                46.6895
#> comp 2      3.7269                13.7642
#> comp 3      2.1298                 7.8658
#> comp 4      1.5768                 5.8236
#> comp 5      1.5351                 5.6694
#> comp 6      1.1152                 4.1186
#> comp 7      0.6467                 2.3883
#> comp 8      0.6098                 2.2523
#> comp 9      0.5149                 1.9017
#> comp 10     0.4686                 1.7308
#> comp 11     0.4194                 1.5490
#> comp 12     0.3796                 1.4018
#> comp 13     0.3128                 1.1552
#> comp 14     0.2797                 1.0329
#> comp 15     0.2621                 0.9681
#> comp 16     0.1236                 0.4563
#> comp 17     0.1212                 0.4476
#> comp 18     0.1040                 0.3840
#> comp 19     0.0833                 0.3076
#> comp 20     0.0146                 0.0538
#> comp 21     0.0107                 0.0397
#>         cumulative percentage of variance
#> comp 1                               46.7
#> comp 2                               60.5
#> comp 3                               68.3
#> comp 4                               74.1
#> comp 5                               79.8
#> comp 6                               83.9
#> comp 7                               86.3
#> comp 8                               88.6
#> comp 9                               90.5
#> comp 10                              92.2
#> comp 11                              93.8
#> comp 12                              95.2
#> comp 13                              96.3
#> comp 14                              97.3
#> comp 15                              98.3
#> comp 16                              98.8
#> comp 17                              99.2
#> comp 18                              99.6
#> comp 19                              99.9
#> comp 20                             100.0
#> comp 21                             100.0
#>                    Dim.1     Dim.2    Dim.3     Dim.4     Dim.5
#> canned goods     0.22151  1.20e-01  0.00149  0.122510 -0.000978
#> dairy eggs       0.89270  1.35e+00 -0.89979 -0.249501  0.003429
#> produce          3.38519 -5.46e-01  0.12652 -0.018243  0.027935
#> beverages        0.08314  5.02e-01  0.50253 -0.087091  1.104763
#> deli             0.16873  1.60e-01  0.03477  0.037581 -0.022161
#> frozen           0.26363  5.67e-01  0.19967  1.133036 -0.108314
#> pantry           0.27934  2.71e-01  0.02251  0.149302 -0.010888
#> snacks           0.27092  8.93e-01  1.00108 -0.410478 -0.539367
#> bakery           0.15404  2.03e-01 -0.00956  0.036576 -0.005598
#> household       -0.01834  1.16e-01  0.06346  0.031781  0.091937
#> meat seafood     0.12813  6.36e-02 -0.01244  0.038850 -0.002224
#> personal care   -0.00385  5.93e-02  0.03544  0.020033  0.034508
#> dry goods pasta  0.16510  1.59e-01 -0.00625  0.103859 -0.016606
#> babies           0.05536  6.88e-02 -0.02009  0.008534 -0.014092
#> missing          0.03241  2.57e-02  0.00731  0.008221  0.003323
#> other            0.00254  3.32e-03  0.00223  0.001340  0.001498
#> breakfast        0.07014  1.66e-01  0.04165  0.002389 -0.015789
#> international    0.05296  2.64e-02  0.00637  0.020592 -0.002705
#> alcohol         -0.02207  1.25e-03  0.00601 -0.000461  0.005266
#> bulk             0.00743  7.95e-05  0.00184 -0.001773 -0.001129
#> pets            -0.00531  1.53e-02  0.00650  0.008148  0.008623

Scaled PCA (Correlation)

We find that the first and second components can explain only 13.6% and 6.6% respectively, and we need 15 components (out of 21) to explain 75% of the variation. This means that correlations between departments are very low and we cannot use PCA to reduce the dimensions of the scaled data.

#>         eigenvalue percentage of variance
#> comp 1       2.861                  13.62
#> comp 2       1.382                   6.58
#> comp 3       1.167                   5.55
#> comp 4       1.049                   4.99
#> comp 5       1.035                   4.93
#> comp 6       1.008                   4.80
#> comp 7       0.990                   4.71
#> comp 8       0.972                   4.63
#> comp 9       0.944                   4.49
#> comp 10      0.931                   4.43
#> comp 11      0.903                   4.30
#> comp 12      0.874                   4.16
#> comp 13      0.871                   4.15
#> comp 14      0.839                   4.00
#> comp 15      0.807                   3.84
#> comp 16      0.791                   3.77
#> comp 17      0.772                   3.67
#> comp 18      0.760                   3.62
#> comp 19      0.736                   3.51
#> comp 20      0.714                   3.40
#> comp 21      0.595                   2.83
#>         cumulative percentage of variance
#> comp 1                               13.6
#> comp 2                               20.2
#> comp 3                               25.8
#> comp 4                               30.8
#> comp 5                               35.7
#> comp 6                               40.5
#> comp 7                               45.2
#> comp 8                               49.8
#> comp 9                               54.3
#> comp 10                              58.8
#> comp 11                              63.1
#> comp 12                              67.2
#> comp 13                              71.4
#> comp 14                              75.4
#> comp 15                              79.2
#> comp 16                              83.0
#> comp 17                              86.6
#> comp 18                              90.3
#> comp 19                              93.8
#> comp 20                              97.2
#> comp 21                             100.0

Supervised Learning

Data Preparation for Models

Before starting applying the models to the data, we have decided to aggregate the column called id_orders by department, so we could know the number of products purchased by department. In addition, we have considered to keep the column order_dow, to identify on which day of the week an order was purchased.

After creating this new table, we converted the column order_dow from numeric(int) to categorical values(factor), and to understand better this values, we change the integer values to the name of the day of the week. For example: The value “0” was transformed to “Sunday”, “1” to “Monday”, “2” to “Tuesday”, and so on.

Moreover, we have decided to split our new table into two, to ensure that the model will not overfit the data and that the results of the predictions are good. To do so, we select for the first set; our training set, 80% of the observations randomly(around 105k obs), and for the observations that remain we took them as our test set(around 26k obs).

Models

Our goal is to determine which day of the week a given order will be placed. Since we have transformed the column order_dow as a factor with categorical values, we will apply models that consider a classification task.

We have chosen the models as follows:

  1. Decision Trees
  2. Random Forest
  3. Multinomial Logistic Regression
  4. Logistic Regression

In addition, we will implement to each of the models some of the following approaches:

  • One day of the week - Unbalanced data
  • One day of the week - Balanced data with Sub-sampling and Cross-Validation
  • Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation

Decision Trees - Classification

Decision trees are algorithms that recursively search the space for the best boundary possible, until we unable them to do so (Ivo Bernardo,2021). The basic functionality of decision trees is to split the data space into rectangles, by measuring each split. The main goal is to minimize the impurity of each split from the previous one.

One day of the week - Unbalanced data

For this approach we want to measure the accuracy of the model with the unbalanced data. Furthermore, it will be interesting to see which departments were considered the best to split the data into days of the week to later be compared to a balanced data with cross-validation (second approach).

According to the pruned tree, we observe that the department produce have the most relevance within the departments, this could be influenced by the fact that this department has the highest number of products purchased in our data set. Furthermore, the tree show us that with an amount of products purchased higher or equal than 3, the model will classify the day of the week as Sunday, if it is lower than 3 the tree will split into another node containing the frozen department.

Likewise, the same procedure will be consider for this node and the following, they will start from the previous node and will try to minimize the impurity at each split. It should be noted that we cannot observed on the terminal nodes all the days of the week, because of the way in which the trees are generated. For the same reason, we expect on the prediction of the test set, a prediction value of “0” on the days of the week different to Sunday and Monday.

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday         0      0        0      0        0       0         0
#>   Monday       694    706      578    650      708     685       661
#>   Saturday       0      0        0      0        0       0         0
#>   Sunday      2787   3228     3202   4843     2483    2538      2476
#>   Thursday       0      0        0      0        0       0         0
#>   Tuesday        0      0        0      0        0       0         0
#>   Wednesday      0      0        0      0        0       0         0
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.211         
#>                  95% CI : (0.207, 0.216)
#>     No Information Rate : 0.209         
#>     P-Value [Acc > NIR] : 0.2           
#>                                         
#>                   Kappa : 0.016         
#>                                         
#>  Mcnemar's Test P-Value : NA            
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                  0.000        0.1795           0.000
#> Specificity                  1.000        0.8217           1.000
#> Pos Pred Value                 NaN        0.1508             NaN
#> Neg Pred Value               0.867        0.8503           0.856
#> Prevalence                   0.133        0.1499           0.144
#> Detection Rate               0.000        0.0269           0.000
#> Detection Prevalence         0.000        0.1784           0.000
#> Balanced Accuracy            0.500        0.5006           0.500
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                  0.882           0.000          0.000
#> Specificity                  0.194           1.000          1.000
#> Pos Pred Value               0.225             NaN            NaN
#> Neg Pred Value               0.861           0.878          0.877
#> Prevalence                   0.209           0.122          0.123
#> Detection Rate               0.185           0.000          0.000
#> Detection Prevalence         0.822           0.000          0.000
#> Balanced Accuracy            0.538           0.500          0.500
#>                      Class: Wednesday
#> Sensitivity                      0.00
#> Specificity                      1.00
#> Pos Pred Value                    NaN
#> Neg Pred Value                   0.88
#> Prevalence                       0.12
#> Detection Rate                   0.00
#> Detection Prevalence             0.00
#> Balanced Accuracy                0.50

As expected, only on Sunday and Monday we get predictive values for all the days of the week, while in the rest we get cero. Overall, the accuracy of this model is low with a score of 0.21, meaning that the model have ((0.21 - (1/7))= 0.077) around 8% of accuracy classifying the days of the week. It is important to recall that there is a big difference between sensitivity and specificity because our data is not balanced.

One day of the week - Balanced data with Sub-sampling and Cross-Validation

Now for this approach, we will balanced the data with sub-sampling and make the overall score more robust by applying to the model a cross-validation technique, this will help us to find the best set of hyperparameters.

#>  .outcome  Fri Mon Sat Sun Thu Tue Wed                                    cover
#>  Saturday [.14 .13 .16 .14 .14 .14 .15] when produce <  3 & frozen >= 1     18%
#>    Sunday [.13 .15 .15 .18 .13 .13 .13] when produce >= 3                   44%
#>  Thursday [.15 .14 .13 .11 .16 .15 .16] when produce <  3 & frozen <  1     38%

The left column (.outcome) of the rules show the day that was selected for the terminal node (the one with the highest probability) and next to it the probability of each day of the week for the department selected. In this case for the last rule it seems that Wednesday and Thursday have the same probability because of the rounding, but Thursday its 0.3% above Wednesday, this can be seen from the tree plot.

The rightmost column (cover) gives the percentage of observations in each rule. The first rule says that Saturday will be chosen when the department produce is lower than 3 and higher or equal than 1 with a probability of 18%. Then we can look at the results of the model in the Confusion Matrix.

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday         0      0        0      0        0       0         0
#>   Monday         0      0        0      0        0       0         0
#>   Saturday     678    645      751    904      595     587       583
#>   Sunday      1454   1797     1768   2992     1235    1291      1259
#>   Thursday    1349   1492     1261   1597     1361    1345      1295
#>   Tuesday        0      0        0      0        0       0         0
#>   Wednesday      0      0        0      0        0       0         0
#> 
#> Overall Statistics
#>                                        
#>                Accuracy : 0.195        
#>                  95% CI : (0.19, 0.199)
#>     No Information Rate : 0.209        
#>     P-Value [Acc > NIR] : 1            
#>                                        
#>                   Kappa : 0.035        
#>                                        
#>  Mcnemar's Test P-Value : NA           
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                  0.000          0.00          0.1987
#> Specificity                  1.000          1.00          0.8223
#> Pos Pred Value                 NaN           NaN          0.1583
#> Neg Pred Value               0.867          0.85          0.8591
#> Prevalence                   0.133          0.15          0.1441
#> Detection Rate               0.000          0.00          0.0286
#> Detection Prevalence         0.000          0.00          0.1808
#> Balanced Accuracy            0.500          0.50          0.5105
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                  0.545          0.4265          0.000
#> Specificity                  0.576          0.6382          1.000
#> Pos Pred Value               0.254          0.1403            NaN
#> Neg Pred Value               0.827          0.8894          0.877
#> Prevalence                   0.209          0.1216          0.123
#> Detection Rate               0.114          0.0519          0.000
#> Detection Prevalence         0.450          0.3697          0.000
#> Balanced Accuracy            0.560          0.5324          0.500
#>                      Class: Wednesday
#> Sensitivity                      0.00
#> Specificity                      1.00
#> Pos Pred Value                    NaN
#> Neg Pred Value                   0.88
#> Prevalence                       0.12
#> Detection Rate                   0.00
#> Detection Prevalence             0.00
#> Balanced Accuracy                0.50

From the confusion matrix we observe a better result between the sensitivity and specificity across the classes, if we compare the previous model with this one, we notice that on the class Sunday the values of the sensitivity changed from 0.882 to 0.545, and for the specificity from 0.194 to 0.576. As expected, the Accuracy has decreased from 0.211 to 0.195, meaning that the model have ((0.195 - (1/7))= 0.052) around 5% of accuracy per day of the week, but the Balanced Accuracy is better. This model would be preferred than the one used with unbalanced data.

Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation

For the Final approach we transformed the levels of the column order_dow into two, one for the days during the week and the other for the weekend. On top of that we balanced our levels “weekday” and “weekend” and consider a Cross-Validation to train the model.

We try to plot the final tree computed by the model, but it was not possible to interpret it, due to the overlapping nodes shown in the graph, but we could see that the departments produce, frozen and meat.seafood, were among the first splits.

The results of the confusion matrix show us a well balanced data from what we can observed in the sensitivity and specificity. The Accuracy of the model is similar comparing it with the other two approaches, the model have ((0.538 - (1/2))= 0.038) just below 4% of accuracy. Overall, all different approaches have a low score at predicting the day of the week based on the department purchases from previous orders.

Random Forest

Random Forest (RF) are algorithms of a set of decision trees that will produce a final prediction with the average outcome of the set of trees considered (user can define the amount of trees and the number of variables for each node). One of the reasons that we decided to test this method is because RF are considered to be more stable than Decision Trees; more trees better performance, but certain advantages come at a price. RF slow down the computation speed and cannot be visualize, however, we will look at the results for later comparison (Saikumar Talari, 2022).

Weekdays and Weekend - Balanced data with Sub-sampling and Cross-Validation

For this method we will consider the same approach as the last one of Classification Tree. We faced some computation speed problems while running the model, for that reason we decided to considered only 10,000 orders to reduce the waiting time.

As expected, the Accuracy of the model is higher than the Classification Tree as well as Cohen’s Kappa and the balanced accuracy. This model would be preferred at predicting “weekday” and “weekend” as it has better results.

Multinomial logistic regression

One day of the week - Unbalanced data

Multinomial logistic regression is a classification method that generalizes logistic regression to multiclass problems, i.e. with more than two possible discrete outcomes (Wikipedia,2021). Like binary logistic regression, multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership.

Our first approach is to predict the day of the week that the order will be placed according to the product composition in the order. Since there are 7 days in a week so it is not a binary logistic regression problem but a multinomial logistic regression problem.

We select the day 0 as the reference level. To build the model, we use the number of products in each department of the order as explanatory variables.

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday       129     99       91     79       93      94       102
#>   Monday        69     87       39     54       95      89        85
#>   Saturday      39     46       50     51       36      29        24
#>   Sunday      3208   3673     3584   5278     2931    2981      2900
#>   Thursday      20     19        9     21       24      18        18
#>   Tuesday        0      0        0      0        0       0         0
#>   Wednesday     16     10        7     10       12      12         8
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.213         
#>                  95% CI : (0.208, 0.218)
#>     No Information Rate : 0.209         
#>     P-Value [Acc > NIR] : 0.105         
#>                                         
#>                   Kappa : 0.01          
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                0.03706       0.02211         0.01323
#> Specificity                0.97548       0.98068         0.98998
#> Pos Pred Value             0.18777       0.16795         0.18182
#> Neg Pred Value             0.86882       0.85043         0.85634
#> Prevalence                 0.13267       0.14993         0.14406
#> Detection Rate             0.00492       0.00332         0.00191
#> Detection Prevalence       0.02618       0.01974         0.01048
#> Balanced Accuracy          0.50627       0.50140         0.50160
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                 0.9609        0.007521          0.000
#> Specificity                 0.0708        0.995444          1.000
#> Pos Pred Value              0.2149        0.186047            NaN
#> Neg Pred Value              0.8723        0.878705          0.877
#> Prevalence                  0.2093        0.121613          0.123
#> Detection Rate              0.2012        0.000915          0.000
#> Detection Prevalence        0.9358        0.004916          0.000
#> Balanced Accuracy           0.5158        0.501483          0.500
#>                      Class: Wednesday
#> Sensitivity                  0.002550
#> Specificity                  0.997100
#> Pos Pred Value               0.106667
#> Neg Pred Value               0.880408
#> Prevalence                   0.119555
#> Detection Rate               0.000305
#> Detection Prevalence         0.002858
#> Balanced Accuracy            0.499825

According to the confusion matrix, the accuracy(0.213) is low and there is a big difference between sensitivity and specificity in each class. For example, the sensitivity of class Friday is 0.028 while the specificity of class Friday is 0.978. Also the kappa(0.01) is very small which means the observed accuracy is only a little higher than the accuracy that one would expect from a random model. We try to balance the data and use a cross-validation to improve the model accuracy.

One day of the week - balanced data with cross-validation

Before balancing the data, We need to check the frequency of each class. The class Wednesday has the smallest frequency(12550). We will balance data by sub-sampling according to the frequency of class Wednesday.

#> 
#>    Sunday    Friday    Monday  Saturday  Thursday   Tuesday 
#>     21972     13925     15738     15121     12768     12896 
#> Wednesday 
#>     12550

We try the cross-validation with the sub-sampling data by using the train function of caret package, but the data set is too big and it takes a very long time to run it so we decide not to include the cross-validation.

#> 
#>    Sunday    Friday    Monday  Saturday  Thursday   Tuesday 
#>     12550     12550     12550     12550     12550     12550 
#> Wednesday 
#>     12550

We only sub-sample the data without applying cross-validation. Now every class has the same frequency(12550).

#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Friday Monday Saturday Sunday Thursday Tuesday Wednesday
#>   Friday       241    245      225    288      193     196       203
#>   Monday       334    451      384    575      297     335       322
#>   Saturday     363    394      434    611      321     305       295
#>   Sunday       862   1065     1097   1959      692     766       723
#>   Thursday     798    834      755    948      758     752       740
#>   Tuesday      311    345      330    414      306     275       273
#>   Wednesday    572    600      555    698      624     594       581
#> 
#> Overall Statistics
#>                                         
#>                Accuracy : 0.179         
#>                  95% CI : (0.174, 0.184)
#>     No Information Rate : 0.209         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.033         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#> 
#> Statistics by Class:
#> 
#>                      Class: Friday Class: Monday Class: Saturday
#> Sensitivity                0.06923        0.1146          0.1148
#> Specificity                0.94068        0.8993          0.8981
#> Pos Pred Value             0.15148        0.1672          0.1594
#> Neg Pred Value             0.86855        0.8520          0.8577
#> Prevalence                 0.13267        0.1499          0.1441
#> Detection Rate             0.00918        0.0172          0.0165
#> Detection Prevalence       0.06063        0.1028          0.1038
#> Balanced Accuracy          0.50496        0.5070          0.5064
#>                      Class: Sunday Class: Thursday Class: Tuesday
#> Sensitivity                 0.3566          0.2375         0.0853
#> Specificity                 0.7491          0.7906         0.9140
#> Pos Pred Value              0.2735          0.1357         0.1220
#> Neg Pred Value              0.8147          0.8822         0.8771
#> Prevalence                  0.2093          0.1216         0.1228
#> Detection Rate              0.0747          0.0289         0.0105
#> Detection Prevalence        0.2730          0.2129         0.0859
#> Balanced Accuracy           0.5529          0.5141         0.4997
#>                      Class: Wednesday
#> Sensitivity                    0.1852
#> Specificity                    0.8423
#> Pos Pred Value                 0.1375
#> Neg Pred Value                 0.8839
#> Prevalence                     0.1196
#> Detection Rate                 0.0221
#> Detection Prevalence           0.1610
#> Balanced Accuracy              0.5138

From the confusion matrix report we can notice that there is an improvement on the difference between sensitivity and specificity of each class. For example, the sensitivity and specificity of class Monday are 0.028 and 0.978 in the previous model. After balancing the data, now the sensitivity and specificity of class Monday are 0.123 and 0.889. The kappa is also higher(from 0.01 to 0.032)

Logistic regression

Weekdays and Weekend - Balanced data and Cross-Validation

The logistic regression is a regression adapted to binary classification. The basic idea of logistic regression is to use the mechanism already developed for linear regression by modeling the probability pi using a linear predictor function, i.e. a linear combination of the explanatory variables and a set of regression coefficients that are specific to the model at hand but the same for all trials. The linear combination is transformed to a probability using a sigmoid function.

In order to further improve our model quality, we think about aggregating the classes of the day of week. Usually the buying behavior is different between weekday and weekend. So we separate the day of week into two classes weekday and weekend.

Now the outcome variable has only two categories so we can use the binomial logistic regression.

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction weekday weekend
#>    weekday    3245    1445
#>    weekend    1972    1339
#>                                         
#>                Accuracy : 0.573         
#>                  95% CI : (0.562, 0.584)
#>     No Information Rate : 0.652         
#>     P-Value [Acc > NIR] : 1             
#>                                         
#>                   Kappa : 0.099         
#>                                         
#>  Mcnemar's Test P-Value : <2e-16        
#>                                         
#>             Sensitivity : 0.622         
#>             Specificity : 0.481         
#>          Pos Pred Value : 0.692         
#>          Neg Pred Value : 0.404         
#>              Prevalence : 0.652         
#>          Detection Rate : 0.406         
#>    Detection Prevalence : 0.586         
#>       Balanced Accuracy : 0.551         
#>                                         
#>        'Positive' Class : weekday       
#> 

According to the confusion matrix the balanced accuracy is higher and the difference between sensitivity(0.622) and specificity(0.480) is even smaller. Now the Kappa is 0.098, higher than Cohen’s Kappa previous model(0.032), and the Accuracy is 0.573. Comparing the result of this model against the previous ones, we note that Random forest model is the one that resembles these results, only Logistic regression is higher on the Accuracy by 0.011 and on Cohen’s Kappa by 0.002 with almost the same Balanced Accuracy.

Variable Importance

Conclusions